library(tidyverse)
library(plotly)
library(shiny)
library(flexdashboard)
library(viridis)
data_clean <- read.csv("data_final.csv")
Project Name: Examining Demographic Patterns in NYC Shooting Data
Project Members: Chenhui Yan (CY2772), Zhaokun Lin (ZL3544), Mingyin Wang (AW3693), Zebang Zhang (ZZ3309)
In this exploratory study, we aim to analyze the geographic and temporal distribution of shootings in New York City, with an emphasis on how socioeconomic factors influence these patterns.
Understanding the dynamics of shooting incidents in New York City is crucial for enhancing public safety and fostering resilient communities. By examining demographic patterns in shooting crime data alongside datasets on high school graduation rates and poverty levels, our project aims to identify the socioeconomic factors that influence gun violence. Analyzing how disparities in education and income relate to shooting incidents provides valuable insights for targeted interventions.
Gun violence is a significant public health and social issue New York City, where diverse communities and complex socioeconomic conditions create a unique landscape for understanding its contributing factors. This project examines the geographic, temporal, and socioeconomic patterns of shootings in NYC to identify critical influences on gun violence. We explore how shooting incidents are distributed across boroughs and neighborhoods and examine temporal trends, such as seasonal, monthly, and time-of-day variations. Additionally, we investigate the relationship between education and gun violence, focusing on whether lower high school graduation rates correlate with higher shooting prevalence. Finally, we analyze the connection between poverty and gun violence to determine if neighborhoods with higher poverty levels experience increased incidents of shootings.
This research will aid in informing targeted public health and safety interventions by identifying the communities most impacted by shootings. Understanding these patterns will enable policymakers to implement more effective violence prevention measures and allocate resources where they are needed the most.
Our project focuses on examining the socioeconomic and temporal factors that influence gun violence in New York City. Specifically, we aim to answer the following questions:
Geographic and Temporal Patterns of Shootings in NYC: - How are shooting incidents distributed geographically across different boroughs and neighborhoods in New York City? - What are the temporal patterns of shootings (e.g., month, season, time of day) in different areas of NYC?
Education and Gun Violence: - Is there an association between high school graduation rates and the prevalence of shootings across NYC neighborhoods? - Are neighborhoods with lower high school graduation rates more likely to experience higher rates of gun violence?
Poverty and Gun Violence: - How does the percentage of people living below the poverty line relate to shooting incidents across neighborhoods in NYC? - Are higher poverty levels associated with an increased frequency of shootings?
This report analyzes shooting incidents in NYC. You can find the original data before merging here. This data is from NYCOpenData, consists some basic information: occur time, coordinate of each incident, victim’s information, .etc.
Neighborhood Tabulation Areas (NTAs) are medium-sized statistical geographies for reporting Decennial Census and American Community Survey. NTAs were delineated with the need for both geographic specificity and statistical reliability in mind. Shapefile of NTA is get from NYC Planning website, you can find the data here, or you can download
The Department of City Planning (DCP) created Community District Tabulation Areas (CDTAs) to closely approximate the 59 Community Districts of New York City for the purpose of reporting American Community Survey (ACS) data. You can find the data here, or you can download
The Borough Boundaries dataset from the NYC Open Data portal provides detailed information about the geographical boundaries of the five boroughs of New York City: Bronx, Brooklyn, Manhattan, Queens, and Staten Island. You can find the data here
Here is neighbourhood list for geo filter. Sourced from city or open source GIS files. Here is GeoJSON file of neighbourhoods of the city.
The data source for education is shown here.
The data source for poverty is here
The original data only consists of some basic information and doesn’t show NTA, CDTA, Borough and neighborhood information of each incident, so we added these information to the original data base on the coordinate of each incident, we also add some demographic information: population, education level, poverty level, .etc to the data.
Here’s how we added the information based on all the data we have:
df=read_csv("./data/NYPD_Shooting_Incident_Data__Historic__20241119.csv")
# Ensure coordinate columns are numeric
df = df %>%
mutate(
X_COORD_CD = as.numeric(X_COORD_CD),
Y_COORD_CD = as.numeric(Y_COORD_CD)
)
# Convert to sf object with CRS EPSG:2263
df_sf = st_as_sf(df, coords = c("X_COORD_CD", "Y_COORD_CD"), crs = 2263)
# Download and read the GeoJSON file
download.file(
url = "https://data.insideairbnb.com/united-states/ny/new-york-city/2024-09-04/visualisations/neighbourhoods.geojson",
destfile = "neighbourhoods.geojson",
mode = "wb"
)
neighbourhoods = st_read("neighbourhoods.geojson")
# Check and transform CRS of neighborhoods
if (st_crs(neighbourhoods)$epsg != 2263) {
neighbourhoods = st_transform(neighbourhoods, crs = 2263)
}
# Perform the spatial join
df_joined = st_join(df_sf, neighbourhoods, left = TRUE)
# Add neighborhood information to the data frame
df$Neighborhood = df_joined$neighbourhood
# Download and use the neighborhoods CSV file
download.file(
url = "https://data.insideairbnb.com/united-states/ny/new-york-city/2024-09-04/visualisations/neighbourhoods.csv",
destfile = "neighbourhoods.csv",
mode = "wb"
)
neighborhoods_csv = read_csv("neighbourhoods.csv")
# Merge additional attributes if necessary
df = df %>%
left_join(neighborhoods_csv, by = c("Neighborhood" = "neighbourhood"))df$X_COORD_CD <- as.numeric(df$X_COORD_CD)
df$Y_COORD_CD <- as.numeric(df$Y_COORD_CD)
# Convert to sf object
df_sf <- st_as_sf(df, coords = c("X_COORD_CD", "Y_COORD_CD"), crs = 2263)
# Read neighborhood shapefile
nta <- st_read("./data/nynta2020_24d/nynta2020.shp")
# Transform CRS if necessary
nta <- st_transform(nta, crs = st_crs(df_sf))
# Perform spatial join
df_joined <- st_join(df_sf, nta, left = TRUE)
# Add neighborhood information to your original data frame
df$NTA <- df_joined$NTAName # Adjust based on actual column name
## add 2010 nta data
df$X_COORD_CD <- as.numeric(df$X_COORD_CD)
df$Y_COORD_CD <- as.numeric(df$Y_COORD_CD)
# Convert to sf object
df_sf <- st_as_sf(df, coords = c("X_COORD_CD", "Y_COORD_CD"), crs = 2263)
# Read neighborhood shapefile
neighborhoods_2010 <- st_read("./data/nynta2010_24d/nynta2010.shp")
# Transform CRS if necessary
neighborhoods_2010 <- st_transform(neighborhoods_2010, crs = st_crs(df_sf))
# Perform spatial join
df_joined <- st_join(df_sf, neighborhoods_2010, left = TRUE)
# Add neighborhood information to your original data frame
df$NTA_2010 <- df_joined$NTAName # Adjust based on actual column nameholidayNYSE() function to determine which
dates are public holidays
df$OCCUR_DATE <- as.Date(df$OCCUR_DATE, format = "%m/%d/%Y")
# Define the range of years, handling NA values
years <- seq(
year(min(df$OCCUR_DATE, na.rm = TRUE)),
year(max(df$OCCUR_DATE, na.rm = TRUE)),
by = 1
)
us_holidays = holidayNYSE(years)
us_holidays = as.Date(us_holidays)
# Add a new column indicating whether the date is a holiday, year and month information of a date
df = df %>%
mutate(
Is_Holiday = OCCUR_DATE %in% us_holidays,
Year = year(OCCUR_DATE),
Month = month(OCCUR_DATE)
)latitude and
longitude), the getSunlightTimes() function
generates dawn and dusk times for each day
df$OCCUR_DATE <- as.character(df$OCCUR_DATE)
df$OCCUR_TIME <- as.character(df$OCCUR_TIME)
# Combine OCCUR_DATE and OCCUR_TIME into a single datetime string
datetime_str <- paste(df$OCCUR_DATE, df$OCCUR_TIME)
# Parse datetime using lubridate
df$OCCUR_DATETIME <- ymd_hms(datetime_str, tz = "America/New_York")
# Extract date from OCCUR_DATETIME
df$DATE <- as.Date(df$OCCUR_DATETIME)
# Get unique dates
unique_dates <- unique(df$DATE)
latitude <- 40.7128
longitude <- -74.0060
# Calculate dawn and dusk times
sun_times <- getSunlightTimes(
date = unique_dates,
lat = latitude,
lon = longitude,
keep = c("dawn", "dusk"),
tz = "America/New_York"
)
# Merge with the original dataframe
df <- df %>%
left_join(sun_times[, c("date", "dawn", "dusk")], by = c("DATE" = "date"))
# Determine if the sky is dark considering twilight
df <- df %>%
mutate(
Sky_Is_Dark = OCCUR_DATETIME < dawn | OCCUR_DATETIME >= dusk
)
df <- df %>%
select(-DATE, -dawn, -dusk)popu <- suppressWarnings(
read_excel("./data/nyc_detailed-race-and-ethnicity-data_2020_core-geographies.xlsx",
sheet = 1,
range = "A4:I2599",
col_types = c("numeric", "text", "text", "text", "numeric", "text", "text", "numeric", "numeric")) |>
filter(`Orig Order` >= 2334 & `Orig Order` <= 2595) |>
select(GeoName, NTAType, Pop) |>
rename(Total_population = Pop)
)
df <- df |>
left_join(popu, by = c("NTA" = "GeoName")) |>
mutate(
NTAType = case_when(
NTAType == 0 ~ "Residential",
NTAType == 9 ~ "Park",
NTAType == 8 ~ "Airport",
NTAType == 7 ~ "Cemetery",
NTAType == 6 ~ "Other Special Areas",
NTAType == 5 ~ "Rikers Island",
is.na(NTAType) ~ "Unknown")
)df$X_COORD_CD <- as.numeric(df$X_COORD_CD)
df$Y_COORD_CD <- as.numeric(df$Y_COORD_CD)
df_sf <- st_as_sf(df, coords = c("X_COORD_CD", "Y_COORD_CD"), crs = 2263)
cdta <- st_read("./data/nycdta2020_24d/nycdta2020.shp")
cdta <- st_transform(cdta, crs = st_crs(df_sf))
df_joined <- st_join(df_sf, cdta, left = TRUE)
df$CDTA <- df_joined$CDTA2020
df$CDTA <- gsub("([A-Z]{2})(\d{2})", "\1 \2", df$CDTA)NTA2020), selects specific
columns, and merges this poverty data with the shooting incident
dataset.
neighborhood_poverty <- read.csv("./data/neighborhood_poverty.csv")
# Select specific columns from neighborhood_poverty
neighborhood_poverty_selected <- neighborhood_poverty %>%
filter(TimePeriod == '2017-21') %>%
filter(GeoType == 'NTA2020' ) %>%
select(Number, Percent, Geography)
# Merge the dataframes on 'NTA' and 'Geography', keeping all information from data_processing
df_poverty <- left_join(df, neighborhood_poverty_selected, by = c("NTA" = "Geography"))
# Rename columns after merging
df_poverty <- df_poverty %>%
rename(Number_poverty = Number, Percent_poverty = Percent)
neighborhood_education = read.csv("./data/graduated_high_school.csv")
neighborhood_education_selected <- neighborhood_education %>%
filter(TimePeriod == '2017-21') %>%
filter(GeoType == 'NTA2020' ) %>%
select(Number, Percent, Geography)
# Merge the dataframes on 'NTA' and 'Geography', keeping all information from data_processing
df_education <- left_join(df_poverty, neighborhood_education_selected, by = c("NTA" = "Geography"))
# Rename columns after merging
df_education <- df_education %>%
rename(Number_education = Number, Percent_education = Percent)
# filter between 2017 and 2023
df_education$OCCUR_DATE <- as.Date(df_education$OCCUR_DATE, format = "%Y-%m-%d")
# Filter the incidents that happened between 2017 and 2023
df_education <- df_education %>%
filter(OCCUR_DATE >= as.Date("2017-01-01") & OCCUR_DATE <= as.Date("2023-12-31"))#add a column that shows the shooting incident rate of NTAs in that year
df_education <- df_education %>%
group_by(NTA, Year) %>%
mutate(incident_rate_by_year_nta = (n() / Total_population)*100) %>%
ungroup()
#add a column that shows the shooting incident rate of boroughs in that year
boro_population <- df_education %>%
group_by(BORO, Year) %>%
summarise(total_population_boro = sum(Total_population, na.rm = TRUE)) %>%
ungroup()
df_education <- df_education %>%
left_join(boro_population, by = c("BORO", "Year")) %>%
group_by(BORO, Year) %>%
mutate(incident_rate_by_year_boro = (n() / total_population_boro)*100) %>%
ungroup()
#Rename a column
df_education <- df_education %>%
rename(Total_population_nta = Total_population)
write.csv(df_education, "data_final.csv", row.names = FALSE)The detailed code is here
library(ggplot2)
library(readxl)
library(readr)
library(tidyverse)
library(dplyr)
library(sf)
library(plotly)
library(geojsonio)
library(knitr)
library(tidyr)
# Load the dataset
df_descriptive=read_csv("data_final.csv")
data_final <- read_csv("data_final.csv")
# Grouping the data by year to count the number of incidents
df_descriptive$Year <- as.factor(df_descriptive$Year)
year_counts <- as.data.frame(table(df_descriptive$Year))
colnames(year_counts) <- c('Year', 'Count')
# Plotting the line chart for shooting incidents by year with data points labeled
ggplot(year_counts, aes(x = Year, y = Count, group = 1)) +
geom_line(color = 'blue') +
geom_point(size = 3) +
geom_text(aes(label = Count), vjust = -0.5, size = 3) +
labs(title = 'Shooting Incidents by Year in NYC',
x = 'Year',
y = 'Number of Shooting Incidents') +
theme_minimal()

The chart shows a stable trend in shooting incidents from 2017 to 2019, followed by a sharp increase in 2020, likely linked to socio-economic factors like the COVID-19 pandemic. Incidents peaked in 2021 and then declined through 2023.
# Bar chart showing the number of TRUE and FALSE in the Sky_Is_Dark variable
sky_dark_counts <- df_descriptive %>%
group_by(Sky_Is_Dark) %>%
summarise(total_incidents = n())
ggplot(sky_dark_counts, aes(x = Sky_Is_Dark, y = total_incidents, fill = Sky_Is_Dark)) +
geom_bar(stat = 'identity') +
labs(title = 'Number of Shooting Incidents by Sky Condition (Dark vs. Bright)',
x = 'Sky Condition (Dark vs. Bright)',
y = 'Number of Shooting Incidents') +
theme_minimal()
The number of shooting incidents when the sky was dark is significantly
higher than when it was bright. This might indicate that shootings are
more likely to occur during nighttime or low visibility conditions. The
higher number of incidents during dark conditions could be due to
factors such as reduced visibility, higher activity at night, or fewer
people around, making it easier for incidents to occur undetected
# Table showing Top 10 total incident NTA in Each Borough
incident_by_nta_borough <- df_descriptive %>%
drop_na() %>%
group_by(BORO, NTA) %>%
summarise(total_incidents = n()) %>%
arrange(BORO, desc(total_incidents)) %>%
group_by(BORO) %>%
slice_max(n = 10, order_by = total_incidents) %>%
select(-total_incidents) %>%
group_by(BORO) %>%
mutate(row_num = row_number()) %>%
pivot_wider(names_from = BORO, values_from = NTA) %>%
unnest(cols = c(BRONX, BROOKLYN, MANHATTAN, QUEENS, `STATEN ISLAND`)) %>%
select(row_num, everything())%>%
slice(1:10)
# Display the table in the desired format
kable(incident_by_nta_borough, caption = "Top 10 NTAs with Total Shooting Incidents in Each Borough (2017-2023)")
| row_num | BRONX | BROOKLYN | MANHATTAN | QUEENS | STATEN ISLAND |
|---|---|---|---|---|---|
| 1 | Mott Haven-Port Morris | Brownsville | East Harlem (North) | Far Rockaway-Bayswater | St. George-New Brighton |
| 2 | Mount Hope | Bedford-Stuyvesant (East) | Harlem (North) | Baisley Park | Tompkinsville-Stapleton-Clifton-Fox Hills |
| 3 | Tremont | Crown Heights (North) | East Harlem (South) | Richmond Hill | West New Brighton-Silver Lake-Grymes Hill |
| 4 | Mount Eden-Claremont (West) | East New York-New Lots | Harlem (South) | Jamaica | Mariner’s Harbor-Arlington-Graniteville |
| 5 | Concourse-Concourse Village | East Flatbush-Remsen Village | Washington Heights (South) | Rockaway Beach-Arverne-Edgemere | Annadale-Huguenot-Prince’s Bay-Woodrow |
| 6 | Williamsbridge-Olinville | East New York (North) | Chelsea-Hudson Yards | St. Albans | Port Richmond |
| 7 | Melrose | Bedford-Stuyvesant (West) | Hamilton Heights-Sugar Hill | South Ozone Park | Rosebank-Shore Acres-Park Hill |
| 8 | Belmont | Canarsie | Inwood | Elmhurst | Grasmere-Arrochar-South Beach-Dongan Hills |
| 9 | Longwood | Coney Island-Sea Gate | Chinatown-Two Bridges | South Jamaica | Great Kills-Eltingville |
| 10 | Fordham Heights | East New York-City Line | Washington Heights (North) | Astoria (East)-Woodside (North) | New Dorp-Midland Beach |
This table allows for comparison of which neighborhoods within each borough have experienced the highest shooting incidents, and helps identify the differences in levels of gun violence among the boroughs of New York City.
# Plot: Description of Victims
victim_summary <- df_descriptive %>%
filter(VIC_SEX %in% c("M", "F")) %>% # Filter out unknown
group_by(VIC_SEX, VIC_AGE_GROUP, VIC_RACE) %>%
summarise(total_victims = n()) %>%
drop_na()
# Plotting the description of male victims
ggplot(victim_summary %>% filter(VIC_SEX == "M"), aes(x = VIC_AGE_GROUP, y = total_victims, fill = VIC_RACE)) +
geom_bar(stat = 'identity', position = 'dodge') +
labs(title = "Description of Male Shooting Victims by Age Group and Race",
x = "Victim Age Group",
y = "Number of Victims",
fill = "Victim Race") +
theme_minimal() +
theme(axis.text.x = element_text(angle = 45, hjust = 1))

# Plotting the description of female victims
ggplot(victim_summary %>% filter(VIC_SEX == "F"), aes(x = VIC_AGE_GROUP, y = total_victims, fill = VIC_RACE)) +
geom_bar(stat = 'identity', position = 'dodge') +
labs(title = "Description of Female Shooting Victims by Age Group and Race",
x = "Victim Age Group",
y = "Number of Victims",
fill = "Victim Race") +
theme_minimal() +
theme(axis.text.x = element_text(angle = 45, hjust = 1))

The bar charts show the distribution of male and female shooting victims by age group and race in NYC. In both charts, Black victims are the most affected across all age groups, with the highest number of victims in the 18-24 and 25-44 age groups. For males, the 25-44 age group shows a notable peak, while for females, the same age group also has the highest numbers. White Hispanic victims also show considerable numbers, particularly in the 25-44 age group.
# Plot: Bar chart showing victim's age
ggplot(df_descriptive %>% filter(!is.na(VIC_AGE_GROUP)& VIC_AGE_GROUP != 1022), aes(x = VIC_AGE_GROUP)) +
geom_bar(fill = "steelblue") +
labs(title = "Bar Chart of Victim's Age",
x = "Victim Age Group",
y = "Count of Victims") +
theme_minimal() +
theme(axis.text.x = element_text(angle = 45, hjust = 1))

The 25-44 age group has the highest number of victims, significantly more than any other group. The 18-24 age group also shows a considerable number of victims, while the <18 and 45-64 age groups have notably fewer victims. The 65+ age group has the least number of victims.
# Count the boro incident
boro_incident_counts <- data_final %>%
group_by(BORO) %>%
summarise(Number_of_Incidents = n(), .groups = "drop") %>%
mutate(BORO = tolower(BORO) )
# Lowercase the boro in boro_shape
boro_shape = boro_shape %>%
mutate(boro_name = tolower(boro_name))
# Merge spatial data with incident counts
boro_map_data <- boro_shape %>%
left_join(boro_incident_counts, by = c("boro_name" = "BORO"))
data_final <- read_csv("data_final.csv")
boro_map_data <- boro_map_data %>%
mutate(
hover_text = paste("Borough:", boro_name, "<br>Total Incidents:", Number_of_Incidents)
)
# Create the interactive plot with click functionality
plot <- plot_ly(
data = boro_map_data,
type = "scattermapbox",
split = ~boro_name, # Separate polygons by boroughs
color = ~Number_of_Incidents, # Color based on the number of incidents
colors = "viridis", # Use a color scale
text = ~hover_text, # Display hover text
hoverinfo = "text",
marker = list(size = 8, opacity = 0.7)
) %>%
layout(
title = "Total Number of Incidents Across NYC BOROs (2017-2023)",
mapbox = list(
style = "carto-positron", # Base map style
center = list(lon = -74.0060, lat = 40.7128), # Center map on NYC
zoom = 10
)
)
# Add click functionality to display the borough name and number of incidents
plot <- plot %>%
event_register("plotly_click") %>%
htmlwidgets::onRender("
function(el, x) {
el.on('plotly_click', function(d) {
var point = d.points[0];
var text = point.text;
alert('You clicked on: ' + text);
});
}
")
# Display the interactive plot
plot
<<<<<<< HEAD
=======